There is no r in lasso, but plenty of lasso in R!

Niels Richard Hansen
16/02-2016

Lasso

The lasso acronym means

least absolute shrinkage and selection operator

and was introduced by Robert Tibshirani in

Regression Shrinkage and Selection via the Lasso

published in the first issue of Journal of the Royal Statistical Society: Series B, 1996.

Pronounciation

You say las-SOO

and I say LAs-so.

From Spanish lazo.

Setup

We consider linear regression models of a response \( Y \) on a vector \( X = (X_1, \ldots, X_p) \) of predictors: \[ E(Y) = \mathbf{X}^T \beta= \sum_{j=1}^p X_j \beta_j. \]

With \( n \) observations we organize \( Y_1, \ldots, Y_n \) in a vector \( \mathbf{Y} \) and \( X_1, \ldots, X_n \) in a \( n \times p \) matrix \( \mathbf{X} \).

The least squares estimator is \[ \hat{\beta} = \mathop{\arg \min}\limits_{\beta} \|\mathbf{Y} - \mathbf{X}\beta\|_2^2 = \mathop{\arg \min}\limits_{\beta} \sum_{i=1}^n (Y_i - X_i^T \beta)^2. \] where \( \|\mathbf{z}\|_2^2 = \sum_{i=1}^n z_i^2 \) denotes the Euclidean norm.

Prostate cancer data example

  • lpsa: log(prostate specific antigen)
  • lcavol: log(cancer volume)
  • lweight: log(prostate weight)
  • age: age of patient
  • lbph: log(benign prostatic hyperplasia amount)
  • svi: seminal vesicle invasion
  • lcp: log(capsular penetration)
  • gleason: gleason score
  • pgg45: percent gleason scores 4 and 5

Prostate cancer data example

The response, \( Y \), will be lpsa. The remaining 8 variables will be the predictors in \( X \).

     lpsa  lcavol lweight age   lbph svi    lcp gleason pgg45
1 -0.4308 -0.5798   2.769  50 -1.386   0 -1.386       6     0
2 -0.1625 -0.9943   3.320  58 -1.386   0 -1.386       6     0
3 -0.1625 -0.5108   2.691  74 -1.386   0 -1.386       7    20
4 -0.1625 -1.2040   3.283  58 -1.386   0 -1.386       6     0

All variables are standardized upfront.

prostate <- scale(prostate, TRUE, TRUE)

Spearman correlations

plot of chunk spearman

A linear model

## Intercept removed due to standardization
prostateLm <- lm(lpsa ~ . - 1, 
                 data = as.data.frame(prostate))
summary(prostateLm)
...
Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
lcavol    0.5762     0.0892    6.46  5.4e-09 ***
lweight   0.2309     0.0741    3.11   0.0025 ** 
age      -0.1370     0.0711   -1.93   0.0571 .  
lbph      0.1216     0.0724    1.68   0.0966 .  
svi       0.2732     0.0860    3.18   0.0021 ** 
lcp      -0.1285     0.1082   -1.19   0.2385    
gleason   0.0308     0.0966    0.32   0.7507    
pgg45     0.1089     0.1061    1.03   0.3072    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.603 on 89 degrees of freedom
Multiple R-squared:  0.663, Adjusted R-squared:  0.633 
F-statistic: 21.9 on 8 and 89 DF,  p-value: <2e-16```


A lasso fit using glmnet

prostateLasso <- glmnet(x = prostate, y = prostate[, "lpsa"], 
                        exclude = 1, intercept = FALSE)
plot(prostateLasso, label = TRUE, lwd = 2)

plot of chunk prostateLasso

A lasso fit from Tibshirani's paper

Tib's figure

What is lasso?

Lasso is a regression technique that gives a family of coefficients \( ({}^{s \!}\beta)_{s \geq 0} \) indexed by a tuning parameter \( s \geq 0 \).

  • For \( s \nearrow s_{\mathrm{max}} \) it holds that \( {}^{s \!}\beta \to \hat{\beta} \) (a least squares estimate).
  • For \( s < s_{\mathrm{max}} \) the estimate \( {}^{s \!}\beta_i \) is shrunk toward 0 compared to \( \hat{\beta}_i \).
  • For small enough \( s \) some coefficients are shrunk all the way to 0. This gives lasso the selection property.

L1-constrained regression

_3dballsnapshot
You must enable Javascript to view this page properly.

\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq 1} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]

where \[ \|\beta\|_1 = \sum_{i=1}^p |\beta_i| \]

L1-constrained regression

_3dballcutsnapshot
You must enable Javascript to view this page properly.

\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq s} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]

Lasso defines a family of estimates.

The penalized version of lasso

\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2 + \lambda \|\beta\|_1. \]

The monotonely decreasing data dependent map

\[ s(\lambda) = \|\beta^{\lambda}\|_1 \]
from \( (0,\infty) \) to \( (0, s_{\mathrm{max}}) \) gives the relation

\[ \beta^{\lambda} = {}^{s(\lambda) \!}\beta \]
between the contrained lasso and the penalized lasso.

Soft thresholding

If \( \mathbf{X} \) is orthogonal, i.e. \( \mathbf{X}^T \mathbf{X} = \mathbf{I}_p \), then \[ \beta_i^{\lambda} = \mathrm{sign}(\hat{\beta}_i) \max\{|\hat{\beta}_i| - \lambda, 0\} \] is soft thresholding of the least squares estimator.

This can be compared to hard thresholding \[ \hat{\beta}_i 1(|\hat{\beta}_i| > \lambda) \]

and linear shrinkage \[ \frac{\hat{\beta}_i}{1 + \lambda} \] related to ridge regression.

Soft thresholding (\(\lambda = 1\))

plot of chunk threshold

Soft thresholding (\(\lambda = 1.5\))

plot of chunk threshold2

Inspiration and related early work

Computing the lasso solution

Lasso is for fixed \( s \) the solution of a quadratic optimization problem with \( 2^p \) linear inequality contraints.

Tibshirani proposes an interative algorithm for adding active contraints. It requires the sequential solution of quadractic optimization problems with linear inequality contraints.

Tibshirani implemented public domain functions for S-PLUS.

To obtain them, use file transfer protocol to lib.stat.cmu.edu and retrieve the file S/lasso, or send an electronic mail message to statlib@lib.stat.cmu.edu with the message send lasso from S.

The LARS algorithm

Homotopy methods as in On the LASSO and its Dual, Osborne, Presnell and Turlach, 2000, JCGS, compute the entire solution path \[ \lambda \mapsto \beta^{\lambda}. \]

The least angle regression (LAR) and its lasso modification (LARS) was developed by Efron et al., and is a fast homotopy algorithm.

prostateLars <- lars(x = prostate[, -1], y = prostate[, "lpsa"], 
                     intercept = FALSE, normalize = FALSE)

This is a pure R implementation!

The LARS fit

plot(prostateLars)

plot of chunk larsPlot

The LARS fit

coefficients(prostateLars)
      lcavol lweight     age   lbph   svi    lcp gleason   pgg45
 [1,]  0.000   0.000  0.0000 0.0000 0.000  0.000  0.0000 0.00000
 [2,]  0.365   0.000  0.0000 0.0000 0.000  0.000  0.0000 0.00000
 [3,]  0.400   0.000  0.0000 0.0000 0.035  0.000  0.0000 0.00000
 [4,]  0.483   0.149  0.0000 0.0000 0.158  0.000  0.0000 0.00000
 [5,]  0.487   0.161  0.0000 0.0000 0.166  0.000  0.0000 0.00917
 [6,]  0.505   0.182  0.0000 0.0443 0.199  0.000  0.0000 0.03388
 [7,]  0.517   0.202 -0.0519 0.0763 0.211  0.000  0.0000 0.05595
 [8,]  0.522   0.213 -0.0812 0.0937 0.219  0.000  0.0107 0.06064
 [9,]  0.576   0.231 -0.1370 0.1216 0.273 -0.128  0.0308 0.10891

Elements of Statistical Learning (ESL)


History

  • ESL first edition was published in 2001.
  • The lars package (a path algorithm) was available from CRAN in May 2003.
  • I read ESL in 2004 and gave a course at UCPH.
  • Least Angle Regression with the LARS algorithm was published in 2004.
  • The glmnet package (cyclic coordinatewise descent) was available from CRAN in June 2008.
  • ESL second edition was published in 2009 introducing me to elastic net and the glmnet package.
  • An Introduction to Statistical Learning with applications in R by James, Witten, Hastie and Tibshirani was published in 2013.

Coordinate descent

The coordinate descent algorithm solves the lasso optimization problem very efficiently.

  • Wenjiang Fu proposes in Penalized regressions: the bridge versus the lasso (1998) his “shooting algorithm”.
  • The use of coordinate descent is neglected for a period.
  • Friedman is external examiner in 2006 on a PhD by Anita van der Kooij, who uses coordinate descent, and the algorithm is taken up again.
  • Hastie says that efficiency of glmnet is due to FFT1).
  • FFT = Friedman + Fortran + Tricks.

1) If you don't get the joke look up the fast Fourier transform.

Impact

Prediction of tumor site

Multinomial lasso

The least squares loss is replaced by the negative log-likelihood:

\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \ell(\beta) + \lambda \|\beta\|_1 \]

With \( K \) groups and \( p \) predictors there will be \( Kp \) parameters.

In the example we have \( K = 9 \) and \( p = 384 \) giving 3456 parameters.

Multinomial lasso

load("miRNA.RData")
miRNAlasso <- glmnet(Xprim, Yprim, family = "multinomial")
par(mfcol = c(3, 3)); plot(miRNAlasso)

plot of chunk miRNAlasso

Cross-validation

miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial", 
                        type.measure = "class")
plot(miRNAlasso, ylim = c(0, 0.4))

plot of chunk miRNAcv

Group lasso

Ordinary lasso penalty

Group lasso penalty

Sparse group lasso

Group lasso penalty

Sparse group lasso penalty

The msgl package

lambda <- msgl.lambda.seq(Xprim, Yprim, alpha = 0, lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda, alpha = 0)

plot of chunk miRNAplot

Multinomial group lasso with glmnet

miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial", 
                        type.measure = "class", 
                        type.multinomial = "grouped")
plot(miRNAlasso, ylim = c(0, 0.4))

plot of chunk miRNAcvmult

Professional R users

A professional cowboy recognizes the layman by his usage of the word lasso.

To the cowboy it is simply the rope.

My next cool statistical method has to be called Rope.

The is R in Rope, and, hopefully, the professional R users will recognize the layman by his usage of lasso - once we have Rope!

Thanks!

Packages

library("glmnet")   ## Lasso (and elastic net) via coordinate descent
library("lars")     ## Lasso via the lar algorithm
library("msgl")     ## Multinomial sparse group lasso
library("reshape2") ## For 'melt'
library("ggplot2")  ## Plotting
library("lattice")  ## Plotting
library("rgl")      ## 3d 
library("misc3d")   ## More 3d